feat(spam): prune confident-spam from published#133
Merged
Conversation
The legacy import carried ~31.8k people, ~61% judged spam by the offline person-evaluations pass. Loading them all exceeded the in-memory heap budget on the standard node size, and spam accounts don't belong in the public civic-transparency dataset anyway. Spec defines: verdict aggregation (prune iff confident spam, no legit, and no project membership), the cascade-prune on `published`, idempotency, and the import → merge → eval → prune → push pipeline ordering. Exclusion happens in the data pipeline, not the runtime loader. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-runnable script that reads person-evaluations verdicts from the spam-detection branch and removes confident-spam people from published with cascaded deletes of their memberships / help-wanted-interest / person tag-assignments, nulling authorId on their project-updates. Project members are protected (real involvement overrides a spam verdict). Reads the ~54k evaluations via streaming git cat-file (not gitsheets) and applies the prune in one gitsheets transaction. Idempotent; --dry-run reports counts. No runtime/loader change — published simply ends up smaller. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
spam-detection.md: replace the "spam-purge is a future plan / filter on the read path" placeholder with the built prune step (command, rule, cascade, idempotency) and add the mandatory import → merge → eval → prune → push ordering warning — a re-import resurrects pruned spam until prune re-runs. cutover.md: add the prune as a required step after the legacy-import merge in both the T-1 and T-0 sequences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The full
publishedimport (~31.8k people, ~61% offline-flagged spam) no longer fits the in-memory heap budget on the 4 GB sandbox nodes — a cold boot OOM'd. Rather than double node cost to hold tens of thousands of spam accounts, prune confident-spam frompublishedso the runtime loads only real members. Spam accounts also don't belong in the public, civic-transparency dataset.Context: this came out of the boot-OOM incident (see #132). Pruning is the durable fix; the memory bump in #131 just bought headroom.
What
specs/behaviors/spam-exclusion.md— the contract: verdict aggregation, prune + cascade scope, idempotency, pipeline ordering.apps/api/scripts/prune-spam.ts— re-runnable operator script. Readsperson-evaluationsverdicts fromspam-detection(streaminggit cat-file, not gitsheets — 54k records), aggregates per person, and cascade-prunes confident-spam frompublishedin one gitsheets transaction.plans/spam-prune.md— the plan (links perf: investigate in-memory state heap footprint (~60x on-disk-to-heap expansion) #132).spam-detection.md+cutover.mdupdated so the reimport process always runs the prune (with the resurrection-on-reimport ordering warning).No runtime/loader change —
publishedsimply ends up smaller.The rule
Prune a person iff: ≥1
spamverdict at confidence ≥ 0.8, and nolegitverdict at any confidence, and no project membership (real involvement overrides a spam verdict). Cascade deletes their memberships / help-wanted-interest / person tag-assignments; nullsauthorIdon their project-updates.Validation (on a throwaway clone)
Spot-checked 10 pruned accounts: 9 unambiguous bulk-created commercial spam; the 1 with a real project membership is now protected by the membership clause.
Ordering (documented, mandatory)
publishedis the merge target oflegacy-import(full raw snapshot). A re-import/merge re-adds pruned spam, so the pipeline must always end with prune: import → merge → (re-)eval → prune → push.🤖 Generated with Claude Code